Prediction is difficult, especially about the future.
— Niels Bohr (apocryphally)
A time series is a special data set where each observation has an associated time measurement. There is a special R structure for storing and operating on time series, called ts, as illustrated here:
births <- scan("http://robjhyndman.com/tsdldata/data/nybirths.dat")
birthstimeseries <- ts(births, frequency=12, start=c(1946,1))
birthstimeseries
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct
## 1946 26.663 23.598 26.931 24.740 25.806 24.364 24.477 23.901 23.175 23.227
## 1947 21.439 21.089 23.709 21.669 21.752 20.761 23.479 23.824 23.105 23.110
## 1948 21.937 20.035 23.590 21.672 22.222 22.123 23.950 23.504 22.238 23.142
## 1949 21.548 20.000 22.424 20.615 21.761 22.874 24.104 23.748 23.262 22.907
## 1950 22.604 20.894 24.677 23.673 25.320 23.583 24.671 24.454 24.122 24.252
## 1951 23.287 23.049 25.076 24.037 24.430 24.667 26.451 25.618 25.014 25.110
## 1952 23.798 22.270 24.775 22.646 23.988 24.737 26.276 25.816 25.210 25.199
## 1953 24.364 22.644 25.565 24.062 25.431 24.635 27.009 26.606 26.268 26.462
## 1954 24.657 23.304 26.982 26.199 27.210 26.122 26.706 26.878 26.152 26.379
## 1955 24.990 24.239 26.721 23.475 24.767 26.219 28.361 28.599 27.914 27.784
## 1956 26.217 24.218 27.914 26.975 28.527 27.139 28.982 28.169 28.056 29.136
## 1957 26.589 24.848 27.543 26.896 28.878 27.390 28.065 28.141 29.048 28.484
## 1958 27.132 24.924 28.963 26.589 27.931 28.009 29.229 28.759 28.405 27.945
## 1959 26.076 25.286 27.660 25.951 26.398 25.565 28.865 30.000 29.261 29.012
## Nov Dec
## 1946 21.672 21.870
## 1947 21.759 22.073
## 1948 21.059 21.573
## 1949 21.519 22.025
## 1950 22.084 22.991
## 1951 22.964 23.981
## 1952 23.162 24.707
## 1953 25.246 25.180
## 1954 24.712 25.688
## 1955 25.693 26.881
## 1956 26.291 26.987
## 1957 26.634 27.735
## 1958 25.912 26.619
## 1959 26.992 27.897
This reads in a data set of number of births per month in New York City from 1946 to 1958 (it’s not clear what the units are - mabe thousands of births?) To create the time series, we had to give the function the frequency, or the number of time points in a year, and the starting value as a vector assigned to start=c(1946, 1), the first element is the year and the second the month.
Here are two different time series, of diabetic drug sales in Australia (in millions of AUS dollaors), also with monthly frequency, and of Boston marathon winning times, with yearly frequency:
#a10
#marathon
The most straighforward way of visualizing time series is using a time plot, which can be created using autoplot:
autoplot(birthstimeseries) +
ggtitle("Number of births in NYC") +
ylab("Births (thousands") +
xlab("Year")
autoplot(a10) +
ggtitle("Antidiabetic drug sales") +
ylab("AUS$ (million)") +
xlab("Year")
autoplot(marathon) +
ggtitle("Boston marathon winning times") +
ylab("Time (minutes)") +
xlab("Year")
Time series of course illustrate changes over time, and frequently we want to describe and account for these changes. In forrecasting, there are three types of systematic patters that have their own terminology (taken from [1])
Trend A trend exists when there is a long-term increase or decrease in the data. It does not have to be linear. Sometimes we will refer to a trend as “changing direction”, when it might go from an increasing trend to a decreasing trend. There is a trend in the antidiabetic drug sales data.
Seasonal A seasonal pattern occurs when a time series is affected by seasonal factors such as the time of the year or the day of the week. Seasonality is always of a fixed and known frequency. The monthly sales of antidiabetic drugs above shows seasonality which is induced partly by the change in the cost of the drugs at the end of the calendar year.
Cyclic A cycle occurs when the data exhibit rises and falls that are not of a fixed frequency. These fluctuations are usually due to economic conditions, and are often related to the “business cycle”. The duration of these fluctuations is usually at least 2 years.
The following data set contains the number of visitors (visitor nights) on a quarterly basis for five regions of New South Wales, Australia:
autoplot(visnights[,1:5], facets=TRUE) +
ylab("Number of visitor nights each quarter (millions)")
One simple question is whether different variables are related to each other. One simple way is to calculate the Pearson correlation between different time series, called the cross-correlation (where \(\bar X\) stands for the mean of X and \(Var(X)\) stands for the variance of \(X\)):
\[ Cor(X,Y) = \frac{\sum_t (\bar X - X_t)(\bar Y - Y_t)}{\sqrt{Var(X)Var(Y)}} \]
In a data set with multiple variables it can be handy to examine the correlations between all pairs between them. Here’s a convenient function for that:
head(visnights)
## NSWMetro NSWNthCo NSWSthCo NSWSthIn NSWNthIn QLDMetro QLDCntrl
## 1998 Q1 9.047095 8.565678 5.818029 2.679538 2.977507 12.106052 2.748374
## 1998 Q2 6.962126 7.124468 2.466437 3.010732 3.477703 7.786687 4.040915
## 1998 Q3 6.871963 4.716893 1.928053 3.328869 3.014770 11.380024 5.343964
## 1998 Q4 7.147293 6.269299 2.797556 2.417772 3.757972 9.311460 4.260419
## 1999 Q1 7.956923 9.493901 4.853681 3.224285 3.790760 12.671942 4.186113
## 1999 Q2 6.542243 5.401201 2.759843 2.428489 3.395284 9.582965 4.237806
## QLDNthCo SAUMetro SAUCoast SAUInner VICMetro VICWstCo VICEstCo
## 1998 Q1 2.137234 2.881372 2.591997 0.8948773 7.490382 2.4420048 3.381972
## 1998 Q2 2.269596 2.124736 1.375780 0.9792509 5.198178 0.9605047 1.827940
## 1998 Q3 4.890227 2.284870 1.079542 0.9803289 5.244217 0.7559744 1.351952
## 1998 Q4 2.621548 1.785889 1.497664 1.5094343 6.274246 1.2716040 1.493415
## 1999 Q1 2.483203 2.293873 2.247684 0.9635227 9.187422 2.3850583 2.896929
## 1999 Q2 3.377830 2.197418 1.672802 0.9968803 4.992303 1.3288638 1.547901
## VICInner WAUMetro WAUCoast WAUInner OTHMetro OTHNoMet
## 1998 Q1 5.326655 3.075779 3.066555 0.6949954 3.437924 2.073469
## 1998 Q2 4.441119 2.154929 3.334405 0.5576796 2.677081 1.787939
## 1998 Q3 3.815645 2.787286 4.365844 1.0061844 3.793743 2.345021
## 1998 Q4 3.859567 2.752910 4.521996 1.1725514 3.304231 1.943689
## 1999 Q1 4.588755 3.519564 3.579347 0.3981829 3.510819 2.165838
## 1999 Q2 4.070401 3.160430 3.408533 0.5960182 2.871867 1.803940
GGally::ggpairs(as.data.frame(visnights[,1:5]))
A time series can be correlated against itself shifted in time by some set amount, also called lagged. We can plot the lagged correlations for different visitor
visnights[,1]
## Qtr1 Qtr2 Qtr3 Qtr4
## 1998 9.047095 6.962126 6.871963 7.147293
## 1999 7.956923 6.542243 6.330364 7.509212
## 2000 7.662491 6.341802 7.827301 9.579562
## 2001 8.270488 7.240427 6.640490 7.111875
## 2002 6.827826 6.404992 6.615760 7.226376
## 2003 7.589058 6.334527 5.996748 6.612846
## 2004 7.758267 6.778836 5.854452 6.200214
## 2005 7.163830 5.082204 5.673551 6.089906
## 2006 8.525916 6.569684 5.771059 6.692897
## 2007 8.158658 5.710082 5.402543 6.803494
## 2008 7.866269 5.616704 5.886764 5.506298
## 2009 6.787225 5.361317 4.699350 6.208784
## 2010 7.148262 4.850217 6.029490 6.238903
## 2011 7.597468 5.815930 6.183239 5.929030
## 2012 7.986931 5.307871 6.054112 6.023897
## 2013 7.028480 5.813450 6.322841 7.101691
## 2014 7.897316 5.997468 6.033533 7.103398
## 2015 8.725132 6.995875 6.294490 6.945476
## 2016 7.373757 6.792234 6.530568 7.878277
visnights_smaller <- window(visnights[,1], start=2000, end = 2010)
gglagplot(visnights_smaller)
Here the colors indicate the quarter of the variable on the vertical axis, compared with the shifted (lagged variable on the horizontal axis, and the lines connect points in chronological order. The relationship is strongly positive at lags 4 and 8, reflecting the strong seasonality in the data.
This suggests that there is a strong similarity between the time series and itself, shifted by certain time values. This is described by the autocorrelation, which is defined as a function of the lag \(k\):
$$ r(k) =
$$ This can be calculated and plotted for our example of the visitation nights in New South Wales:
ggAcf(visnights_smaller)
Notice the periodicity in the autocorrelation, which indicated periodicity in the time series. Let’s similarly calculate the autocorrelation of the drug sales data:
ggAcf(a10)
Notice how different this correlogram is - there are no zero values of autcorrelation, only slow decay with some small periodic components.
Autocorrelation measures the memory of a signal - for example, pure white noise is uncorrelated with itself even a moment later, and thus has no memory. As such, it is very useful as a measure of a trend in the data - if the time series has slowly decaying, positive autocorrelation, that indicates a pronounced trend, while periodicity indicates seasonality in the data.
Exercise: Use the lag and autocorrelation analysis to describe the patterns in the time series of births in NYC and in the Boston marathon winning times.
There are two main types of decompositions of time series: additive and multiplicative. Let us call \(X_t\) the time series data, \(T_t\) the trend (non-periodic component), \(S_t\) the seasonal part (periodic component), and \(R_t\) the remainder.
\[ X_t = T_t + S_t + R_t \] \[ X_t = T_t \times S_t \times R_t \]
One simple way of removing seasonality and estimating the trend is using the moving average, that is using \(k\) points before and \(k\) points after each point to calculate the trend: \[ T_t = \frac{1}{m} \sum_{i=-k}^k X_{t+i} \]
Here \(m\) is called the order of the moving average and is defined as \(m = 2k+1\). There is a useful function ma() that calculates these averages and allows them to be plotted.
m <- 3
s_name <- paste("MA-", m)
autoplot(a10, series = "data") +
autolayer(ma(a10, m), series = "MA") +
xlab("Year") + ylab("AUS $ (millions)") +
ggtitle("Anti-diabetic drug sales")
## Warning: Removed 2 rows containing missing values (geom_path).
Exercise: Change the moving average window and see if you can make seasonality vanish!
An even order of periodicity requires an asymmetric averaging window, so to create an symmetric average, one can repeat the moving average of order two on the already-averaged data.
Additive decomposition [1]:
a10 %>% decompose(type="additive") %>%
autoplot() + xlab("Year") +
ggtitle("Classical additive decomposition
of the antidiabetic drug sales data")
This simple classical decomposition has numerous flaws, so better, more modern methods are preferred. In particular, it assumes a constant seasonal term, it tends to over-estimate the variation in the trend, it misses data for the first few and last few data points, and can be sensitive to outliers.
A more robust method is called the STL decomposition (Seasonal and Trend decomposition using Loess). To summarize its advantanges [1]:
a10 %>% stl(t.window=13, s.window="periodic", robust=TRUE) %>%
autoplot()
Exercise: Apply the two decomposition methods to the Boston marathon and to births in NYC time series.
Let us analyze the data set of US quarterly economic data, specifically, the percent change in consumption, income, production, savigs, and unemployment.
head(uschange)
## Consumption Income Production Savings Unemployment
## 1970 Q1 0.6159862 0.9722610 -2.4527003 4.8103115 0.9
## 1970 Q2 0.4603757 1.1690847 -0.5515251 7.2879923 0.5
## 1970 Q3 0.8767914 1.5532705 -0.3587079 7.2890131 0.5
## 1970 Q4 -0.2742451 -0.2552724 -2.1854549 0.9852296 0.7
## 1971 Q1 1.8973708 1.9871536 1.9097341 3.6577706 -0.1
## 1971 Q2 0.9119929 1.4473342 0.9015358 6.0513418 -0.1
autoplot(uschange[,c("Consumption","Income")]) +
ylab("% change") + xlab("Year")
uschange %>%
as.data.frame() %>%
ggplot(aes(x=Income, y=Consumption)) +
ylab("Consumption (quarterly % change)") +
xlab("Income (quarterly % change)") +
geom_point() +
geom_smooth(method="lm", se=FALSE)
uschange %>%
as.data.frame() %>%
GGally::ggpairs()
Let us use four variables as predictors of consumption to calculate a multiple linear regression model using the function tslm():
fit.consMR <- tslm(
Consumption ~ Income + Production + Unemployment + Savings,
data=uschange)
summary(fit.consMR)
##
## Call:
## tslm(formula = Consumption ~ Income + Production + Unemployment +
## Savings, data = uschange)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.88296 -0.17638 -0.03679 0.15251 1.20553
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.26729 0.03721 7.184 1.68e-11 ***
## Income 0.71449 0.04219 16.934 < 2e-16 ***
## Production 0.04589 0.02588 1.773 0.0778 .
## Unemployment -0.20477 0.10550 -1.941 0.0538 .
## Savings -0.04527 0.00278 -16.287 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3286 on 182 degrees of freedom
## Multiple R-squared: 0.754, Adjusted R-squared: 0.7486
## F-statistic: 139.5 on 4 and 182 DF, p-value: < 2.2e-16
We can produce a plot of the predicted values together with the observed data on consumption:
autoplot(uschange[,'Consumption'], series="Data") +
autolayer(fitted(fit.consMR), series="Fitted") +
xlab("Year") + ylab("") +
ggtitle("Percent change in US consumption expenditure") +
guides(colour=guide_legend(title=" "))
It is useful to check the residuals of the regression model:
checkresiduals(fit.consMR)
##
## Breusch-Godfrey test for serial correlation of order up to 8
##
## data: Residuals from Linear regression model
## LM test = 14.874, df = 8, p-value = 0.06163
These two data sets, on Australian air passengers and rice production in Guinea, have a very strong positive correlation:
aussies <- window(ausair, end=2011)
fit <- tslm(aussies ~ guinearice)
summary(fit)
##
## Call:
## tslm(formula = aussies ~ guinearice)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.9448 -1.8917 -0.3272 1.8620 10.4210
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.493 1.203 -6.229 2.25e-07 ***
## guinearice 40.288 1.337 30.135 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.239 on 40 degrees of freedom
## Multiple R-squared: 0.9578, Adjusted R-squared: 0.9568
## F-statistic: 908.1 on 1 and 40 DF, p-value: < 2.2e-16
However, notice that the residuals indicate a strong trend, which violates the assumptions of linear regression.
checkresiduals(fit)
##
## Breusch-Godfrey test for serial correlation of order up to 8
##
## data: Residuals from Linear regression model
## LM test = 28.813, df = 8, p-value = 0.000342
There are a number of fun examples of spurious time series correlations in reference [5].
One can distringuish between true forecasting, termed ex-ante (from before) prediction, in which we truly try to predict the unknown future, and ex-post forecasts, in which the true values both of the predictors and the response variable are known. The latter is still useful for validating models and for comparing different methods.
The library forecast contains tools to make calculating predicted values in time series simple. One can use the model to forecast the values in the future, based on different scenarios. For example, we may want to investigate the prediction for an economic upturn and a downturn:
fit.consBest <- tslm(
Consumption ~ Income + Savings + Unemployment,
data = uschange)
h <- 4
newdata <- data.frame(
Income = c(1, 1, 1, 1),
Savings = c(0.5, 0.5, 0.5, 0.5),
Unemployment = c(0, 0, 0, 0))
fcast.up <- forecast::forecast(fit.consBest, newdata = newdata)
newdata <- data.frame(
Income = rep(-1, h),
Savings = rep(-0.5, h),
Unemployment = rep(0, h))
fcast.down <- forecast::forecast(fit.consBest, newdata = newdata)
# This script does not work for some reason!
autoplot(uschange[, 1]) +
ylab("% change in US consumption") +
autolayer(fcast.up, PI = TRUE, series = "increase") +
autolayer(fcast.down, PI = TRUE, series = "decrease") +
guides(colour = guide_legend(title = "Scenario"))